Add Serializable ABC for Python#5139
Conversation
|
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
This provides a generic interface for handling serialization in Python. Handles serialization with Dask through the `"cuda"` and `"dask"` serializers. Also implements pickle serialization using `__reduce_ex__`. For Python versions with support for pickle's protocol 5, the class also supports out-of-band buffers for more efficient serialization. Subclasses are responsible for implementing `serialize` and `deserialize` to/from a Python `dict` with the `header` and a collection of `frames`. The abstract base class handles all other serialization using these two methods.
01dff21 to
e37b4a1
Compare
|
|
||
| @classmethod | ||
| def device_deserialize(cls, header, frames): | ||
| for f in frames: |
There was a problem hiding this comment.
This isn't going to be true with pack/unpack serialization which packs into a single host buffer that stores metadata, and a single device buffer that stores data.
There was a problem hiding this comment.
Yep, that makes sense. Happy to change this as we see fit.
The bigger idea here is that when we make these changes we can now do them in one place. So hopefully that makes pack/unpack and other changes in the future easier :)
There was a problem hiding this comment.
Agreed. I think then we can leave this as is for now. :)
There was a problem hiding this comment.
Added PR ( #5309 ), which should handle a mixture of host and device frames.
|
cc @Salonijain27 @hcho3 (who have also been thinking about serialization in the cuML and XGBoost contexts recently 🙂) |
Since `memoryview`s are not generally pickleable, coerce them to NumPy arrays, which are, when handling pickling with older pickle protocols. For pickle protocol 5+, stick with coercing them to `PickleBuffer`s, which pickle knows how to serialize out-of-band.
|
Just as a note, we should be able to drop |
As `Index` already subclasses `Serializable`, there is no need for `MultiIndex` to subclass `Serializable` as well. Plus it seems to be confusing Python. So drop it.
Serializable ABC for PythonSerializable ABC for Python
Serializable ABC for PythonSerializable ABC for Python
|
|
||
| # all (de-)serializations are attached to cudf Objects: | ||
| # Series/DataFrame/Index/Column/Buffer/etc | ||
| serializable_classes = ( |
There was a problem hiding this comment.
+1 one from me as well! I especially like moving away from this list to anything which implements Serializable
|
cc @pentschev so he is aware as well |
pentschev
left a comment
There was a problem hiding this comment.
Avoiding code duplication is always nice, thanks for doing that @jakirkham !
|
Thanks all! 😄 |
Handling in PR ( rapidsai/cuml#2281 ). |
Opening this as WIP, RFC PR to facilitate discussion. There are a few different serialization efforts. So hopefully we can figure out how this would fit in.
Adds an abstract base class
Serializable, which pulls together some serialization logic in the cuDF Python codebase in one spot. Subclasses are responsible for implementing aserializemethod, which breaks an object down into aheader(Pythondict) andframes(Buffers), and adeserializemethod, which takes theheaderandframesto produce a new object. In return theSerializableclass provides a method for host serialization, device serialization, and pickling the object (using out-of-band buffers with pickle's protocol 5 when available). Also theSerializableclass is registered with Dask for serialization. So any class that subclassesSerializableis automatically registered with Dask serialization as well.cc @kkraus14 @shwina @quasiben